Skip to content

NetApp: wait for volume to become RW after snapmirror break#299

Open
Carthaca wants to merge 2 commits intostable/2023.1-m3from
netapp-wait-for-rw-after-snapmirror-break
Open

NetApp: wait for volume to become RW after snapmirror break#299
Carthaca wants to merge 2 commits intostable/2023.1-m3from
netapp-wait-for-rw-after-snapmirror-break

Conversation

@Carthaca
Copy link
Collaborator

During replica promotion two replicas failed with mount errors after
snapmirror break operations. NetApp audit logs showed that the break
commands were issued but remained in "Pending" state while Manila
immediately attempted to mount the volumes. The mounts failed because
the volumes were still DP type - the break operations hadn't completed
yet.

The break_snapmirror method previously assumed break_snapmirror_vol()
was synchronous and the volume would immediately be RW. In practice,
the break operation can take several seconds to complete.

This adds a polling loop after break_snapmirror_vol() that waits for
the volume type to transition from 'dp' to 'rw' before attempting the
mount. The implementation mirrors the existing wait_for_quiesced logic,
using the netapp_snapmirror_quiesce_timeout config with 5-second
intervals.

If the volume doesn't become RW within the timeout, a NetAppException
is raised with details about the timeout and volume name.

@sumitarora2786
Copy link

just a thought, snapmirror relationship status can also be checked to be "broken-off"

@crenduchinta88
Copy link

crenduchinta88 commented Mar 16, 2026

Another thought: we could mount the volume while it’s still in DP mode once promote is selected, and then perform the SnapMirror break.

@sumitarora2786
Copy link

is it allowed, afai remember the mount was not allowed to DP type vol, the junction path can be applied to once the vol is no more DP, maybe something change in new releases

@crenduchinta88
Copy link

crenduchinta88 commented Mar 16, 2026

is it allowed, afai remember the mount was not allowed to DP type vol, the junction path can be applied to once the vol is no more DP, maybe something change in new releases

Yes, adding junction-path to the DP volume is possible both before and after the initial transfer. However, data access through the junction path is only permitted once the baseline copy has completed.

link

@sumitarora2786
Copy link

sumitarora2786 commented Mar 17, 2026

yes, that was the confusion and I barely remember, during the initial transfer it use to give some warning.
but mount once the baseline is completed should be fine..

@Carthaca , since we offer RO replicas, the mount operation must have been kicked earlier, is it only for the case where the user creates the access rule for the replica/destination?
wondering why the process outlined in https://codi.eu-nl-1.cloud.sap/LMtI1FUzS-iVAiFSTBirfA tries mount after break, maybe update is required to old doc but still curious :)

During replica promotion replicas failed with mount errors after
snapmirror break operations. NetApp audit logs showed that the break
commands were issued but remained in "Pending" state while Manila
immediately attempted to mount the volumes. The mounts failed because
the volumes were still DP type - the break operations hadn't completed
yet.

The break_snapmirror method previously assumed break_snapmirror_vol()
was synchronous and the volume would immediately be RW. In practice,
the break operation can take several seconds to complete.

This adds a polling loop after break_snapmirror_vol() that waits for
the volume type to transition from 'dp' to 'rw' before attempting the
mount. The implementation mirrors the existing wait_for_quiesced logic,
using the netapp_snapmirror_quiesce_timeout config with 5-second
intervals.

If the volume doesn't become RW within the timeout, a NetAppException
is raised with details about the timeout and volume name.

Additionally, this optimizes promotion for readable replicas by
skipping the mount operation entirely. Readable replicas are already
mounted when created (with junction path) since they need to be
accessible for read operations. Only DR replicas need mounting after
snapmirror break.

Change-Id: I2b8f9a1c5d7e3a4f6b9c8d1e2f3a4b5c6d7e8f9a
Signed-off-by: Maurice Escher <maurice.escher@sap.com>
@Carthaca Carthaca force-pushed the netapp-wait-for-rw-after-snapmirror-break branch from 6b5f68f to c314414 Compare March 19, 2026 13:37
@Carthaca
Copy link
Collaborator Author

@sumitarora2786 yes, the additional mount is not needed in the readable replica case - I made that optimization to the code.

The both sides DP problem was caused by periodic update trying to "repair" the original snapmirror (create,initialize,resume,resync in NetApp logs), this should be avoided now, too

@Carthaca Carthaca changed the title NetApp: wait for volume to become RW after snapmirror break WIP: NetApp: wait for volume to become RW after snapmirror break Mar 19, 2026
@Carthaca Carthaca marked this pull request as draft March 19, 2026 13:58
…tion

During replica promotion, the API layer sets the replica status to
STATUS_REPLICATION_CHANGE and both promote_share_replica and
_share_replica_update use the @locked_share_replica_operation
decorator to prevent concurrent execution via a shared lock.

However, after a promotion failure, a sequential race can occur:

1. API sets replica status to STATUS_REPLICATION_CHANGE
2. Promote operation acquires lock and starts
3. Promote fails (e.g., mount error after snapmirror break)
4. Exception handler sets both replicas to ERROR status
5. Promote releases lock and exits
6. periodic_share_replica_update acquires lock shortly after
7. Sees both replicas in ERROR, but no snapmirror relationship
8. Attempts to recreate snapmirror based on stale database state
9. Creates relationship in wrong direction (old source -> old dest)

This adds two safeguards in _share_replica_update:

1. Explicitly skip replicas with STATUS_REPLICATION_CHANGE status.
   While the lock prevents concurrent execution during promotion,
   this provides defense-in-depth and makes the intent explicit.
   STATUS_REPLICATION_CHANGE is intentionally NOT added to
   TRANSITIONAL_STATUSES as it has special handling in the Share
   model's instance property for replica selection ordering.

2. If both the active replica and target replica are in ERROR
   status, skip the driver update entirely. This prevents automatic
   "recovery" after failed critical operations that require manual
   intervention. Without this, periodic updates recreate snapmirror
   relationships in incorrect directions after failed promotions.

The checks are placed in the share manager (not the driver) as
they are policy decisions about when to skip automatic operations.

Change-Id: I3c7d9b2e8f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c
Signed-off-by: Maurice Escher <maurice.escher@sap.com>
@Carthaca Carthaca force-pushed the netapp-wait-for-rw-after-snapmirror-break branch from c314414 to 0a1f256 Compare March 20, 2026 13:30
@Carthaca Carthaca marked this pull request as ready for review March 20, 2026 13:46
@Carthaca Carthaca changed the title WIP: NetApp: wait for volume to become RW after snapmirror break NetApp: wait for volume to become RW after snapmirror break Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants